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1 .  Introduct ion .  The  form  of  the  Bayes  estimate  of  the  population 
mean  with  respect  to  a  Dirichlet  prior  with  parameter  a  has  given  rise  to 
the  interpretation  that  a(iQ  is  the  prior  sample  size.  Furthermore,  if 
a(X)  is  made  to  tend  to  zero,  then  the  Bayes  estimate  mathematically  con¬ 
verges  to  the  classical  estimator,  namely  the  sample  mean.  This  has 
further  given  rise  to  the  general  feeling  that  allowing  o(X)  to  become 
small  not  only  makes  the  'prior  sample  size'  small  but  also  that  it 
corresponds  to  no  prior  information.  By  investigating  the  limits  of 
prior  distributions  as  the  parameter  a  tends  to  various  values,  we  show 
that  it  is  misleading  to  think  of  o(X)  as  the  prior  sample  size  and  the 

V 

smallness  of  a(X)  as  no  prior  information.  In  fact  very  small  values  of 

a(X)  actually  mean  that  we  have  very  definite  information  concerning  the 
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2.  The  Dirichlet  measure.  Let  (X,  A)  be  separable  metric  space  0i3tr^“‘  code 


endowed  with  the  corresponding  Borel  o-field.  Let  P  and  M  be  the 


A'.- 


jrtvi/or 


of  probability  measures  and  finite  measures  (countably  additive)  on 


«.sV 


'pocvrA 


■OV*» 


» 
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(X,  A).  The  natural  o-field,  c(P),  on  P  is  the  smallest  o-field  in  P 
such  that  the  function  P  I — >  P(A)  is  measurable  for  each  A  in  A.  Therfe'  ' 
is  also  the  notion  of  weak  convergence  in  both  P  and  M,  namely,  %  a 
if  and  only  if  fgda^  f gda  for  all  bounded  continuous  functions  on  X. 

Under  this  convergence  P  becomes  a  separable  complete  metric  space 
(Prohorov  [4])  and  the  o-field  o(P)  above  is  the  Borel  o-field  in  P. 

To  each  non-zero  measure  a  in  M,  we  denote  by  a  the  corresponding  normalized 
measure,  namely  a (A)  *  a(A)/o(X),  A  e  A. 

In  non- parametric  Bayesian  analysis,  the  'true1  probability  measure 
P  takes  values  in  P,  is  random  and  has  a  prior  distribution.  To  facilitate 
the  use  of  standard  probability  theory  we  must  view  P  as  a  measurable  map 


from  some  probability  space  (0,  S,  Q)  into  (P,  o(P))  and  the  induced 


,-l 


measure  QP  becomes  the  prior  distribution.  For  any  non- zero  measure  o 


in  M,  the  Dirichlet  prior  measure  Da  with  parameter  o,  is  defined  as 


follows  (Ferguson  [3]):  For  any  finite  measurable  partition  (Aj,  . A^) 
of  X,  the  distribution  of  (P(Aj),  . ...  P(A^))  under  Dq  is  the  singular 
Dirichlet  distribution  D(a(A^),  a(A^))  defined  on  the  k-dimensional 

simplex  as  in  Wilks  [7]  Section  7.7.  Ferguson  [3]  used  this  definition 
and  also  an  alternate  definition  (See  Theorem  1  of  Ferguson  [3]),  and 
derived  many  properties  of  Dirichlet  priors  and  the  corresponding  Bayes 
estimates  of  population  parameters.  Blackwell  [1]  and  Blackwell  and 
MacQueen  [2]  have  also  given  alternative  definitions  of  the  Dirichlet 
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prior.  We  give  below  yet  another  definition  of  the  Dirichlet  prior  which 
is  more  general  than  the  previous  ones  since  we  will  not  have  to  assume 
that  X  is  separable  metric.  Let  a  be  a  non- zero  measure  in  M.  Let 
(ft,  S,  Q)  be  a  probability  space  rich  enough  to  support  two  independent 
sequences  of  i.i.d.  random  variables  Yj,  Y2>  ...  and  9^,  e2,  ....  where 
is  X-valued  and  has  distribution  a  and  is  real  valued  and  has  a 
Beta  distribution  with  parameters  1  and  o(X).  Let  Pj  ■  0j,  p2  ■  ©2(l-Oj), 
P3  *  Q3(1-0j) (l-®2) ,  ....  For  any  y  in  X  let  stand  for  the  degenerate 
probability  measure  at  y.  Define  the  measurable  map  P  from  (Q,  S)  into 
(P,  o(P))  as  follows: 

00 

P(A)  «  l  p.6v  (A).  (1.1) 

j=l  3  j 

Then  the  induced  distribution  of  P  is  the  Dirichlet  measure  D  with 

a 

parameter  a.  The  proof  of  this  fact  and  that  the  standard  properties 
of  Dirichlet  measures  can  be  deduced  from  this  will  be  given  elsewhere, 
Sethuraman  [5]. 

In  the  statistical  problem  of  non-parametric  Bayesian  analysis  we 
have  a  random  variable  P  taking  values  in  P  and  whose  distribution  is  Dq. 

We  also  have  a  sample  ,  ....  XR,  which  are  random  variables  taking  values 
in  X.  Given  P,  these  are  i.i.d.  with  common  distribution  P.  It  is  re- 

f  a 

quired  to  estimate  a  function  4(P),  and  the  Bayes  estimator  4  with  respect 
to  squared  loss  is  given  by 

EWmlXj,  ...,  Xn). 

In  particular,  if  $(P)  ■  ♦  (P)  where 

O 
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♦g(P)  -  / g(x)P(dx) 


(1.2) 


2 

where  g  is  a  real  valued  measurable  function  on  X  with  / g  aa  <  »,  then  the 


Bayes  estimate  is  given  by 


a(X)/gda  ♦  n/gdFr 


(1.3) 


where  Fr  is  the  empirical  d.f.  of  X^,  ....  (Ferguson  [3]).  In  this  if 
we  let  a(X)  ■>  0  we  obtain  the  classical  estimate  /gdFR.  Also  the  denominator 
in  this  estimate  is  a(X)  ♦  n  which  is  a(X)  plus  the  sample  size.  These 
facts  have  given  rise  to  the  interpretation  that  a(X)  is  the  prior  sample 
size  and  allowing  a(X)  to  tend  to  zero  corresponds  to  no  prior  information. 

In  the  next  section  we  investigate  what  happens  to  Dirichlet  measures 
when  their  parameters  are  allowed  to  converge  to  certain  values.  In 
section  4  we  investigate  what  happens  to  Bayes  estimates  when  the  parameters 
of  the  corresponding  Dirichlet  priors  are  allowed  to  converge  to  the  zero 
measure.  From  the  results  in  these  two  sections  it  follows  that  small 
values  of  a(X)  actually  correspond  to  certain  definitive  information  about 


3.  Convergence  of  Dirichlet  measures.  In  this  section  we  study 
the  convergence  of  Dirichlet  measures  as  their  parameter  is  allowed  to 
converge  in  appropriate  ways.  Since  (P,  a(P))  is  a  separable  complete 
metric  space  endowed  with  its  Borel  o-field,  we  can  talk  about  the  usual 
weak  convergence  of  probability  measures  on  (P,  o(P))  and  of  Dirichlet 
measures,  in  particular. 


5 


THEOREM  3.1.  Let  (i  }  be  a  sequence  of  measures  in  M  and  let  the 
sequence  of  normalized  measures  {ar>  be  tight.  Then  the  sequence  {0q  }  of 
Dirichlet  measures  is  tight. 

PROOF.  Fix  e  >  0.  There  exists  a  sequence  of  compact  sets  in  X 
such  that 


sup  <*r(Kj)  *  6e/d3ir2, 


(3.1) 


d  *  If  2,  ....  Let 


Md  *  (P:  P(K®)  s  l/d). 


(3.2) 


d  *  1,  2,  ...  and  let 


M  »  n  M.. 
d  d 


(3.3) 


Then  clearly  M  is  a  compact  subset  of  P  in  the  weak  topology.  Now,  by  the 
Chebysheff  inequality 


Dq  (Mj)  SdEp  (P(K*))  -  d  ar(Kj)  S  6e/ir2d2 
r  ar 


(3.4) 


D  (Mc)  £  £  6e/ir2d2  ■  e,  for  all  r. 
r  d 


(3.5) 


This  proves  that  (Do  }  is  tight.  □ 


THEOREM  3.2.  Let  (or>  be  a  sequence  of  measures  in  M  such  that 


sup  |ar(A)  -  oq(A)  |  -*■  0 
A 


(3.6) 


where  a  is  a  non-zero  measure  in  M.  Then  D_  converges  to  D  weakly. 
°  ®r  ao 
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PROOF.  The  proof  of  this  result  rests  heavily  on  the  constructive 
definition  of  the  Dirichlet  measure  in  (1.1)  and  the  following  result 
which  is  proved  in  Sethuraman  [6] . 

Let  (8^)  be  a  sequence  of  probability  measures  on  an  arbitrary  measurable 
space  (/,  8)  and  let 

sup  |Br(B)  -  8o(B)|  -►  0,  (3.7) 

B 

where  8q  is  a  probability  measure  on  (/,  8) .  Then  there  exists  a  sequence 

of  /-valued  random  variables  (Y  )"  with  marginal  distributions  (8  }°°  such 

r  o  r  o 

that 

Prob.  (Y  f  Y  }  -*■  0  as  r  ♦  «.  (3.8) 

r  o 

From  (1.1)  and  the  abov<;  result,  we  can  find  independent  sequences  of  i.i.d. 
random  variables  {Y*},  {0*},  r  »  0,  1,  2,  ...  such  that  the  distribution 
of  Yj  is  a  ,  the  distribution  of  9^  is  Beta  with  parameters  1  and  (X) , 
r  e  0,  1,  ....  and 

Prob.  (Y?  +  Y°)  ♦  0  (3.9) 

and 

Prob.  (0*  +  0?)  +  0  as  r +  •,  j  «  1,  2 .  (3.10) 

Furthermore,  if  p*  ■  0j,  pT  ■  0^(l-0j_j)  ...  (1-0*)  for  j  2  1,  and 

m 

Pr(A)  «  I  pU  _(A), 

j-1  3  y* 


(3.11) 


7 


then  the  distribution  of  Pr  is  the  Dirichlet  measure  D  ,  r  »  0,  1 . 

“r 

From  (3.11)  it  can  be  easily  shown  that,  for  any  integer  m. 


sup  | Pr(A)  -  P° (A) |  £  f  |p'  -  p?|  ♦  l  I(Y'  t  Y°) 

A  j=l  3  3  j=l  3  3 

(3.12) 

♦  2  7T  d-o?)  ♦  TTu-eJ)- 

j=i  3  j*i  3 


From  the  construction  above  and  (3.8),  (3.9)  and  (3.12)  and  by  first 

choosing  m  appropriately  and  then  allowing  r  to  tend  to  ®  that 

sup  |Pr(A)  -  P° (A) |  -►  0  in  probability  which  is  a  stronger  assertion  than 
A 

made  in  the  theorem,  namely  that  D  D  weakly.  □ 

“r  “o 


THEOREM  3.3.  Let  {a^}  be  a  sequence  of  measures  in  M  such  that 

ay(X)  •>  0  and  sup  |ar(A)  -  ao(A)|  +  0  as  r  +  •,  (3.13) 

A 

where  o  is  a  probability  measure  in  P.  Then  the  measures  Dq  converge  to  a 
°  r 

random  degenerate  measure  6  where  Y°  has  distribution  o  . 

yO  o 

PROOF.  As  before  we  can  construct  independent  sequences  of  i.i.d. 
random  variables  {Y?}  and  (o'),  and  an  independent  random  variable  Y°, 
such  that  y'  has  distribution  Y°  has  distribution  oq,  the  distribution 
of  o'  is  Beta  with  parameters  1  and  <*r(X),  r  =  1,  2,  ...,  and 

Prob.  (y'  j*  Y°)  +  0  as  r  ».  (3.14) 

Furthermore,  if  p'  ■  o',  p^  *  0?(1-Qj_j)  •••  (l-9j)>  for  j  *  1.  and 
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eo 

Pr(A)  *  l  P*6  (A), 

j»i  J  r. 


(3.15) 


then  the  distribution  of  P  is  the  Dirichlet  measure  with  parameter  a^, 
r  =  1,  2,  .... 

From  (3.15),  it  is  easily  seen  that 

sup  ! Pr(A)  -  6  (A) |  S  I(Y*  *  Y°)  ♦  2(l-p^) .  (3.16) 

A  Y°  1  1 

From  (3.14)  and  the  fact  that  af(X)  0,  it  follows  that 

sup  |Pr(A)  -  6  (A)  j  -*■  0  in  probability  which  again  is  stronger  than  the 

A  Y° 

assertion  of  the  theorem.  0 


From  Theorem  3.2  it  is  clear  that  allowing  ar(X)  to  tend  to  zero  does 
not  correspond  to  no  information  on  P.  In  fact  if  ar(X)  ■+•  0  and  the  nor¬ 
malized  measure  u '  converges  in  the  strong  sense  of  (3.13)  to  a  probability 
measure  aQ,  then  the  information  about  P  is  that  it  is  a  probability  measure 
concentrated  at  a  particular  point  in  X  which  is  chosen  at  random  according 
to  aQ.  This  is  definitely  very  strong  information  about  P  and  most  probably 
not  of  the  type  any  statistician  would  be  willing  to  make. 

4.  Convergence  of  Bayes  estimates.  In  this  section  we  are  mainly 
interested  in  the  limits  of  Bayes  estimates  of  various  function  $(P)  as 
o(X)  ■+•  0.  We  will  therefore  make  the  following  assumption  throughout  this 


section : 


<*r(X)  0  and  sup  |ar(A)  -  aQ(A)|  0, 

A 


(4.1) 


where  a  is  a  probability  measure  in  P.  We  will  also  be  mainly  concerned 
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with  a  special  class  of  functions  $(P)  as  defined  below.  Let  g  be  a 

.  k  l 

permutation  invariant  measurable  function  from  X  into  R  such  that 


/  |  » *  •  •  *  Xj *  1 1  * x2  *  *  •  *  xm*  •  •  •  I  da  (Xj)  ...  da  (x^)  ^  ®  (4  •  2) 

for  all  possible  combinations  of  arguments  (Xj,  . ..,  Xj,  x2»  ....  x2,  ...» 
xm’  **•»  xm^  from  a11  distinct  (m  a  k)  to  all  identical  (m  »  1).  When 
the  function  g  vanishes  whenever  any  two  coordinates  are  equal,  condition 
(4.2)  reduces  to  the  simple  condition 


/  1 S (xj »  •••»  **)!  da(Xj)  ...  da (x^)  < 


(4.3) 


Define  the  parametric  function 


♦g(P)  =  /  g(xr  xo)dP(x1)  ...  dP(xk) 


(4.4) 


for  all  those  P’s  for  which  it  exists.  Let  P  have  Dq  as  the  prior  distri¬ 
bution  and  let  (Xj,  ...»  Xr)  be  a  sample  from  P.  Under  further  assumptions 

v 

concerning  the  second  moment  of  g  under  a  ,  the  Bayes  estimate  (with  respect 


to  squared  error  loss)  of  4  (P)  based  on  the  sample  is 

8 

?n 


*g,a  =  WP),X1 . V’ 


(4.5) 


and  based  on  no  sample  is 


♦g.a  '  VVP))'  (4.6) 

Since  the  conditional  distribution  of  P  given  (X. ,  ....  X  )  is  D 

1  n  a*nF  * 

n 

where  Fn  is  the  empirical  distribution  function  of  (Xj,  ....  Xn),  we  have 

(4.7) 


;n  ,  :o 
*8.<»  *g,a*nFn* 


Suppose  that  we  substitute  a  -  ar  where  (ar)  satisfies  (4.1).  From  the 
results  of  section  3  we  know  that 


and 


D  -*-6  weakly, 
a  „o  1  * 


(4.8) 


Da  +nF  *  DnF 
r  n  r 


(4.9) 


as  r  +  ■•  The  main  result  of  this  section  pertains  to  the  convergence  of 
the  Bayes  estimates  $°  and  #°  A  e  . 

g»«r  s»VnFn 


THEOREM  4.1.  Let  condition  (4.1)  hold.  Let  g  be  a  continuous  function 

from  X*  into  R1.  Let  g(x,,  ...,  xt,  x_,  ...,  x,,,  ...,  x_  ....  x  )  be 

i  i  c  l  mm 

uniformly  integrable  with  respect  to  o^,  for  all  combinations  of  arguments 
(Xj,  ....  Xj,  x2,  x2,  ....  xffl,  ....  xffl)  from  all  distinct  to  all 

identical .  Then 


f  8(X*  x)d“o(x) 


(4.10) 


and 


g,nF  -  ed„.  (*(zl . V> 


«-V"Fn  . n  “r,F 


(4.U) 


where  (Z^,  ....  Z^)  is  a  sample  from  P  where  P  has  the  distribution  D 


nF 


PROOF.  The  easiest  way  to  prove  this  result  is  to  use  the  repre¬ 
sentation  (1.1)  for  the  random  probability  measure  P  with  a  Dirichlet 
distribution.  The  uniform  integrability  conditions  on  g  with  respect  to 
«r  immediately  show  that  $  (Pr)  is  uniformly  integrable  with  respect  to  D 


since  it  is  the  convex  combination  of  uniformly  integrable  functions  as 
given  below: 
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vpI) 


where  Y*,  ...  are  i.i.d.  with  common  distribution  a This  fact  and  (4.8) 
and  (4.9)  establish  the  results  (4.10)  and  (4.11)  of  the  theorem.  0 


The  results  of  this  theorem  generalize  those  of  Ferguson  [3]  Section 
5b  and  5e  and  Yamato  [8],  [9].  Also  when  g(x1>  ....  x^)  is  such  that  it 
vanishes  whenever  two  coordinates  arc  equal,  it  is  easy  to  see  that 


where  U  is  the  usual  U  statistic  based  on  g  and  the  sample  (X.,  ...,  X  ). 
g,n  x  n 

This  result  is  also  contained  in  Yamato  [8] ,  [9] . 
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The  form  of  the  Bayes  estimate  of  the  population  mean  with  respect  to  a  Dirichlet 
prior  with  parameter  a  has  given  rise  to  the  interpretation  that  a(X)  is  the  prior 
sample  size.  Furthermore,  if  a(X)  is  made  to  tend  to  zero,  then  the  Bayes  estimate 
mathematically  converges  to  the  classical  estimator,  namely  the  sample  mean.  This 
has  further  given  rise  to  the  general  feeling  that  allowing  a(X)  to  become  small  not 
only  makes  the  'prior  sample  size'  small  but  also  that  it  corresponds  to  no  prior 
information.  By  investigating  the  limits  of  prior  distributions  as  the  parameter  a 
tends  to  various  values,  we  show  that  it  is  misleading  to  think  of  a(X)  as  the  prior 
sample  size  and  the  smallness  of  a(X)  as  no  prior  information.  In  fact  very  small 
values  of  a(X)  actually  mean  that  we  have  very  definite  information  concerning  the 
unknown  true  distribution. 


